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ABSTRACT 

This study defined and validated a new set of dimensions, new 
anchoring descriptions, and a new rubric format for assessing participation in 
collaboration. One strand of the research explored the use of analog video- 
technology to conduct summative assessment of collaborative inquiry. The 
second strand of the research explored the use of video digital technology to 
conduct formative assessment of collaborative inquiry. Participants were from 
seven middle school classrooms taught by two teachers at two schools. Students 
in all classrooms were asked to complete a brief genetics performance 
assessment, and 42 student triads were videotaped for the 2 research strands. 
In the first strand, five graduate students evaluated the collaborations. It 
appeared that the summative assessment practice attained a level of precision 
sufficient for comparing groups of students to each other, although it did not 
appear that this approach is likely to yield the precision needed by any 
formal accountability system. In the second strand, triads of students (15 
sets) were asked to review their own assessment tapes and then score their own 
collaboration using a scale developed for the purpose. This approach does 
appear to be a promising method of enhancing participation in collaboration 
and increasing students' ability to engage in collaborative learning that 
might be implemented on a larger scale . (Contains 4 tables and 29 references.) 
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This project used diverse views to define and validate ways to assess and promote 
collaboration and participation in collaborative inquiry in science. The value of such an effort is 
illustrated by the many educational standards documents that highlight the value of the ability to 
work with others (e.g., Kendall & Marzano, 1997). Haertel and Means’ (2000) review of current 
educational research methodology highlighted the need for common instruments & methods for 
assessing valued learning outcomes. The ascendance of standards-oriented reforms highlights the 
value of characterizing outcomes in a manner that can be readily communicated to diverse 
stakeholders while simultaneously directing useful educative activity. Haertel & Means also 
highlighted the need for new measures that reflect contemporary views on knowing and learning, 
particularly for documenting outcomes promised by technology-supported innovations. While all 
educational standards acknowledge the importance of collaboration, this elusive construct 
confounds conventional assessment and evaluation practices. Particularly in the current 
accountability-oriented climate, educational outcomes that cannot be readily measured and 
communicated to diverse stakeholders are likely to be overlooked. 

This effort built on the Pi’s prior attempts to evaluate the impact of GA Tech’s Learning by 
Design middle school science curriculum (Kolodner et al, 1998, in review). In one part of this 
evaluation, over 150 teams in LBD and comparison classrooms were videotaped while 
collaboratively completing a performance assessment obtained from the Performance Assessment 
Links in Science (PALS, 2001) website. These videotapes were scored using nine dimensions of 
collaboration advanced in prior research by Pomplun (1996). These scores, along with the scores of 
the students’ performance assessments, were analyzed and reported as evidence of the effectiveness 
of the LBD curriculum (Hickey, 1999; 2000b; an extended discussion of the theoretical and 
practical issues in this approach is presented in Hickey, 2001) 



1 The project was supported by a Seed Grant from the Center for Innovative Learning Technologies, which is 
supported by the National Science Foundation. We gratefully acknowledge the participation of the students 
and teachers in the two classroom where this research was conducted. We also acknowledge the 
participation of the following collaborators on this effort: Simona Laprocina, Paula Schwanenflugel, 
Elizabeth Meisinger (University of Georgia) Jennifer Holbrook & Jackie Gray (GA Tech); Joy Mordica, 
Nancy Schafer (GA State), Juan Balderas & Vicki Winn, Ogden County Schools; Dan Dunlap, VA Tech; 
Richard A. Duschl, King's College, London; Sanna Jarvela, University of Oulu, Finland; Steven McGee, 
NASA Classroom of the Future; Sharon Nelson-Barber, WestEd; Bill Penuel, SRI International; and Steven 
Tanimoto, University of Washington. For more information, contact Daniel T. Hickey, 611 Aderhold Hall, 
Athens, GA 30602, dhickey@coe.uga.edu. 
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The present project extended and refined this assessment practice. Specifically, we defined 
and validated a new set of dimensions, new anchoring descriptions, and a new rubric format for 
assessing participation in collaboration, and attempting to use these to help learners evaluate and 
improve their own collaboration. Thus, one strand of our research explored the use of analog video- 
technology to conduct summative assessment of collaborative inquiry. The second strand of our 
research explored the use of digital video technology to conduct formative assessment of 
collaborative inquiry. The combination of these two strands in a single effort directly addresses 
contemporary concerns with the relationship between validity and value among educational 
researchers. Following from Frederiksen and Collins (1989), we explore the relationship between 
evidential and consequential validity (Hickey, Wolfe, & Kindfield, 2000) relative to collaborative 
inquiry. Specifically we considered whether a marginally reliable summative assessment of 
collaborative inquiry can still be valid because it supports learning and communicates value to 
learners, educators, and policy makers. 

Our efforts build on contemporary views of learning (Bransford, Brown, & Cocking, 1999) 
and formative assessment (e.g., Black & Wiliam, 1998, Graue, 1993, Tumstall & Gipps, 1996). We 
are searching for modest, scaleable assessment practices that motivate learners to engage in 
effective “assessment conversations” (Duschl & Gitomer, 1997) that promise to dramatically 
enhance student learning. This activity is characterized by authentic scientific argumentation (e.g., 
Driver, Newton, & Osborne, 2000) in which students are making and warranting knowledge claims 
based on evidence and theory (e.g., Jimenez-Aleixandre, Rodriguez, & Duschl, 2000). 

Along the way we grappled with some of the issues that contemporary sociocultural views 
of knowing and learning (e.g., Greeno, 1998; Wenger, 1998) present when considering what it 
means to define, assess, and promote collaboration and collaborative learning. Divergent 
perspectives on learning yield remarkably divergent conceptualizations of collaboration. Our use of 
collaboration & participation acknowledges the tension between the conventional notion of 
individuals acquiring domain-general skills & dispositions that support collaboration, and 
contemporary sociocultural notions of shared inquiry-oriented practices becoming ritualized within 
particular communities of learners. A fundamental assumption of this project is that a single 
definition of “collaboration” that is broadly meaningful to learners, educators, researchers, and 
policy makers will be valuable for fostering collaboration and the ability to collaborate in our 
schools. The project was deliberately set up to present the conflicts and contradictions that need to 
be negotiated in order to meet this admittedly idealistic long-term goal. 

METHOD 

The participants in this study were from seven classrooms taught by two teachers at two 
schools that previously participated in the GenScope Assessment Project (Hickey, 2000a). These 
schools were in the same school district but served very different student populations. One school 
served a relatively lower-SES suburban community. Over 30% of the students at this school had 
qualified for the federal lunch subsidy, and nearly every student (99.5%) was African American. 

The school typically posted average achievement scores that were below the national average, but 
higher than most of the other schools in this district that also served predominantly African 
American students. State data showed that 61% of these students passed the science component of 
the high school graduation test on their first attempt. The other was a high SES suburban school 
where 12% of the students were non-white, 1.5% of the students received subsidized lunch, and 
95% of the students passed the science graduation test on their first attempt. Data from the 
GenScope Assessment Project showed substantial disparity in genetics knowledge between the two 
sets of classrooms. As in the previous GenScope evaluations (e.g., Hickey, Kindfield, Horwitz, & 
Christie, 1999), mean proficiency in the lower-SES classrooms after genetics instruction was lower 
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than the mean performance in the higher SES classrooms before instruction. Specifically, 
performance on conventional genetics content measures and performance assessments a few weeks 
prior to the first data collection in this study differed by at least 1.5 SD. In homogenous populations 
in our research, this difference equals roughly 2 grade levels. 

Students in all seven classrooms were asked to complete a brief (about 30 minute) genetics 
performance assessment (called Human Inheritance) obtained from the PALS Website.. Forty-two 
of these triads were videotaped (6 per classroom) with 8mm cameras equipped with wide-angle 
lenses and tabletop microphones. These tapes were then used in two different strands of research, 
described next. 

Summative Assessment of Collaborative Inquiry 

The first strand of this effort was refining a set of dimensions and corresponding practice for 
assessing collaborative inquiry. Five graduate students used ten of the videotapes of collaboration 
to interpretively and empirically analyze 29 candidate dimensions following from Pomplun (1996), 
Gray (2001), and Jimenez-Aleixandre, Rodriguez, & Duschl (2000). They were asked to identify a 
small subset that reflected the perceived needs of researchers, educators, and policy makers. They 
were specifically instructed to exclude dimensions that presented concerns about equity or cultural 
bias, and to include dimensions that reflected contemporary situative perspectives that emphasize 
how domain knowledge is constructed within and partially bound to ritualized collaborative inquiry. 

These five students spent roughly 20 hours working together during five separate meetings. 
This effort was further informed by a half-day workshop that occurred about halfway through the 
process, conducted at AERA 2001. The PI and the five graduate students along with ten outside 
researchers (Dunlap, Duschl, Gray, Holbrook, Jarvela, Nelson-Barber, McGee, Penuel, Ravitz, 
Tanimoto) discussed the various dimensions, equity issues, measurement issues, and policy issues. 
This meeting lent general support to the importance of the effort, the selected dimensions, and the 
validity of the approach, and laid the groundwork for subsequent collaboration. The effort of the 
five graduate students resulted in the six dimensions shown on Table 1. 

In order to assess whether or not it was possible to score the six final dimensions reliably, 
three of graduate students (Hand, Kyser, & Laprocina) independently scored 23 of the remaining 
videotapes according to the six dimensions on Table 1. Scorer 1 scored all 23 of the tapes while 
scorers 2 and 3 scored 10 and 13 tapes, respectively. As in the earlier effort using the Pomplun 
dimensions, inter-rater reliabilities were disappointingly low. The correlation of the summed scores 
between the first and second raters on 13 tapes was only .74, while the correlation of the summed 
scores between the first and third raters on 10 tapes was .56. Examining the correlations for each of 
the six scales showed a mixed pattern of divergence across the scales, with scorer 1 and scorer 2 
diverging the most on Scale 3, and scorer 1 and scorer 3 diverging most on Scale 4 and Scale 5. 
Particularly given that these correlations do not include a correction for chance (i.e., Cohen’s K), 
our observed reliabilities are problematic from conventional measurement perspective. Of course, 
correlations based on such small numbers are fickle, and further training of scorers is likely to 
increase inter-rater reliability. Nonetheless, the present effort represented a substantial, coordinated 
effort on the part of motivated, thoughtful graduate students. Coupled with the modest reliabilities 
obtained previously, it seems that reliable summative assessment of collaborative inquiry via the 
present method will continue to be problematic. 

Equity. One of the issues we struggled with throughout this process concerned equity. We 
were concerned that culturally specific styles of interaction might bias scores. Specifically, we 
were concerned that interaction styles that are culturally appropriate in a racial minority community 
might be viewed as disrespectful or uncooperative by observers from a mainstream culture. We 
discussed this issue at length while defining the dimensions, and relied a great deal on the input 
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from the two African American participants (Joy Mordica, who completed a term paper on the topic 
for a course on the psychology of inner city learners taught by Asa Hilliard at GSU, and Jessica 
DeCuir, whose UGA Ed Psych dissertation topic is African American Identity). Additional insights 
were provided at the AERA meeting by Sharon Nelson-Barber, a leading expert in the area of 
cultural validity and equity in assessment. 

A preliminary review of the videotapes scored in the summative assessment revealed 
interesting examples of African American students effectively using African American Vernacular 
English (AAVE), a dialect of English, to successfully negotiate shared understanding of content and 
collaboration standards. It seems likely that many scorers who are unfamiliar with AAVE would be 
inclined to give these examples poor marks on some dimensions (e.g., Collaborative Participation) 
because of their difficulty discerning this dialect or possibly because of racial biases. Our 
expectation is that including the Warranting Knowledge Claims with Data and Warranting 
Knowledge Claims wit Theory dimensions would offset this problem by both giving scorers another 
dimension that would be scored more highly, and by reminding scorers to try to “look past” cultural 
biases on the other more domain-general dimensions. We had hoped to explore these issues in more 
interpretive and empirical detail. However, neither of the African American graduate students who 
participated in the original scale definition participated in the subsequent scoring of the 23 
additional tapes, and the other were not familiar enough with AAVE or the cultural norms of 
African American secondary students to represent that perspective in the effort. 

Summary. In summary, we believe that our summative assessment practice attained a level 
of precision sufficient for comparing groups of students to each other. For example, it seems that 
this practice is a valid and appropriate way of comparing students in two different curricular 
environments in the same domain (as in the LBD program evaluation). This approach appears 
equally appropriate for a wide range of students who might be assessed in western nations. 
Additional work is needed to reduce unexplained variance before making more precise 
interpretations, such as the differences between two teachers implementing the same curriculum. In 
particular, it seems that more reliable scores would have been obtained more quickly had we spent 
more time identifying benchmark examples to train scorers. Better yet, we might consider having 
students dramatize examples to clearly illustrate what different levels of each of the dimensions 
look like. 

These improvement aside, the results of this project and the prior effort suggests that this 
approach to summative assessment of collaboration is unlikely to ever yield the precision demanded 
by any formal accountability system. Indeed, as a purely summative assessment, it seems that our 
model of assessment practice actually has rather limited value. 

Formative Assessment of Collaborative Inquiry 

The second strand of our effort follows from Fredriksen and Collins’ (1989) notion of 
“systemic validity”: 

A systemically valid test is one that induces in the educational system curricular and 
instructional changes that foster the development of the cognitive skills that the test is 
designed to measure. Evidence for systemic validity would be an improvement in those 
skills after the test has been in place within the educational system for a period of time (p 
27). 

From this perspective, a systemically valid collaboration assessment practice directly enhances 
student collaboration; this contrasts with the conventional view that assessment indirectly supports 
learning by identifying more or less effective programs. A systemically valid collaboration 

ERIC 



5 



